Quantum Monte Carlo

Understanding QMC algorithms

2016-07-07T12:38:00.001-05:00

I started working on QMCPACK earlier this year. As usual when diving into an unfamiliar code base, one of the first challenges is to understand how the code works. Scientific codes can take more work to comprehend because they implement complex algorithms, and one must understand the underlying algorithms in addition to the code structure. Reverse-engineering the algorithm from the code is usually challenging. Hopefully there's an associated paper that describes the algorithm. However, publications give a high-level view of the algorithm and the mathematics, but there's usually a gap between the paper and the details of the implementation.

Once the code and algorithm are understood, the next question is how to best test the code. Ideally the original implementer should leave behind code and documentation for this purpose, but usually what we have is the final code and a paper. What other kinds of artifacts (code, documentation, data, test cases, etc.) should an implementer produce to help with this?

In an attempt to address these issues, I created a repository with various Python scripts and Jupyter (IPython) notebooks covering different algorithms used in a Quantum Monte Carlo code. The repository is here: https://github.com/markdewing/qmc_algorithms

Some more benefits and goals of these scripts include

Ease of comprehension

Notebooks show the derivations and demonstrate the algorithm working in a simpler context. These aim to be a bridge between the paper or text description of an algorithm and the final production implementation.

Symbolic Representation

Represent the various equations in a more symbolic form. This allows mathematics-aware transformations, such as taking derivatives. See the Jastrow factor in Wavefunctions/Pade_Jastrow.ipynb for an example. Also the wavefunctions and expressions for the energy in Variational/Variational_Hydrogen.ipynb and Variational/Variational_Helium.ipynb.

Testing

These scripts can be used for testing in several ways. One is to evaluate the expressions at a single point to check values in unit tests.

Eventually the symbolic or script representation should be compared against the code for a wider range of values and parameters. This is similar to property-based testing in software engineering, but there are a couple of issues to address. The first is infrastructure support. Most property-based testing frameworks focus on integer parameters, like array lengths, and not so much on floating point values. A second issue is cross-language coding. In our case, the script representation is in Python and the code to test is in C++.

Also for testing, simpler or slower algorithms can be used to verify more complex algorithms. For example, the repository contains a python script for Ewald summation (LongRange/ewald_sum.py) that uses the standard break-up between short and long range pieces. QMCPACK optimizes this split between the short and long range pieces, but the result should still be the same, up to some error tolerance.

Code generation

Ultimately, code generation can be used to connect the symbolic form to the executable code. This will a require a maintainable mechanism for doing the code generation (it's easy to start doing code generation, but there can be problems with scaling it and keeping it neat, clean, and understandable. ) The previous step - using the symbolic forms for testing - seems like an easier intermediate path, and is better suited for testing existing code.

Summary

This is an experiment in creating artifacts that enhance comprehension and testing of scientific programs. Take a look at the repository ( https://github.com/markdewing/qmc_algorithms ) and see if you find it useful.

Reduced-order models in QMC

2016-01-18T12:26:00.000-06:00

After the last post about multiscale modeling, let's explore how some of the methods and ideas might apply to improving QMC.

Possible sources:
- Reduced order models. Usually applied to time-dependent problems.
- Dimensionality reduction. From statistics and machine learning
- Low-rank matrix approximation.
- Compressed sensing - determine a sparse sampling before performing the sampling. See this article for an application to wavefunctions.

Some of these approaches are applications to different domains but use similar underlying mathematics (e.g., apply Singular Value Decomposition (SVD) to an appropriate matrix).

The "order" in "reduced-order models" refers to the complexity of the model. Some possibilities for QMC on how the complexity might be varied include: number of electrons, number of sample points in the integral, quality of the wavefunction, and the precision of the arithmetic.

Let's look at each of these in a little more detail:

1. Number of electrons
A. Integrate out some electrons - yields pseudopotential
B. Integrate out all electrons - yields intermolecular potentials between atoms/molecules

To ultimately reduce the complexity of the model, most or all of the electrons need to be eliminated. This is also the natural decomposition at an initial look - generate potential from electronic structure calculation. However, this may be too large of jump of scales in some cases, and it may be useful to find intermediate models that still include some or all of the electrons.

2. Number of sample points in the integral
A. Subset of points from an existing set of samples. Not so useful for that particular run, but could be useful for a nearby system. Possible ways a system could be 'nearby':
- Changing parameters for VMC optimization
- Core electrons for psuedopotentials
- Weak intermolecular interactions
B. Generate new points faster or more accurately
Quasi Monte Carlo methods (the other QMC) or other low discrepancy sequences.

Side note on pseudopotentials. They serve two purposes:
a. Reduce the number of particles that need to be handled
b. Reduce the variance introduced by sampling the core electrons

I suspect that (b) is more important than (a), at least for small problems. Pseudopotentials complicate various QMC algorithms, so improved sampling that reduces the variance could be an effective alternative.

3. Quality of wavefunction
In general we would like to have a systematic way of controlling the quality (computational complexity vs. error) of the wavefunction. One use for less-accurate approximations to the wavefunction might be as a guide for faster generation of new sample points.

4. Precision of arithmetic
The full model is really the position of all the electrons in space. We might consider restricting the electrons to lattice points or reducing the precision in places. For examples in machine learning, see here and here.

Towards Understanding Multiscale Modeling

2015-12-31T17:01:00.000-06:00

I've been interested in multiscale modeling lately, and this post is an attempt to collect some resources and clarify some of my understanding of the topic.

Some places to start are the wikipedia entry and the scholarpedia entry. Go more in-depth with the book "Principles of Multiscale Modeling". Also this review (pdf) (From 2000, but still a good overview of an array of fields and problems).

There are some other 'multi's that may be encountered in modeling:

Multiphysics - combining multiple physical models in a single simulation. For instance, a thermal model and a mechanical model where the mechanical properties are temperature-dependent. Usually the models occur on the same length scale (vs. multiscale)
Multigrid - method of accelerating a solution and bridging the largest and smallest lengths in a single model. Typically using grids of different resolution.
Multiscale - combining models at different length (or time, energy) scales

What defines a 'scale'?

Each scale consists of a self-contained theory (or model), approximations for solution (often these approximations come to define variants of each model), and validation by experiments (carefully designed to probe at this scale, and isolate the influence of other scales).

Transitioning between scales often involves a qualitative change in the type of theory (particle-based, continuum modeled with PDE, network of coupled ODE's, etc.) The "self-contained" part usually implies a small number of input parameters. Some of this is practical - if the scale had too many inputs it would be hard to understand, hard to reason about, and hard to compute.

The simplest type of multiscale modeling is sequential (also called 'hierarchical'), where a lower level model is used to derive parameters for a higher level model. For instance, using electronic structure methods to find parameters for empirical potentials in an atomistic simulation.
The other type is concurrent modeling (in some cases called 'on the fly'), where simulations at different scales are done at the same time. In the example above, the atomistic simulation would call down to the electronic structure simulation as needed to get necessary parameters. Alternately, if it is not possible to parameterize the potential, the atomistic simulation would get forces and energies directly from the electronic structure simulation.

Multiscale in Materials

This whole process seems most developed for engineering certain metals and alloys, where it's called Integrated Computational Materials Engineering.

The book "Integrated Computational Materials Engineering for Metals: Using Multiscale Modeling to Invigorate Engineering Design with Science" has some case studies, including one on the design of a Cadillac control arm, covering the scales from electronic structure (DFT) to the final mechanical design of arm. I found it helpful to have a specific example running the entire range.

Scales encountered for materials problems include

electronic structure
atomic motion
dislocations
voids and other microstructures
continuum description
engineering design (using FEA - Finite Element Analysis)

The Minerals, Metals and Materials Society (TMS) produced a report that covers the current state and challenges for the field: "Modeling Across Scales: A Roadmapping Study for connecting materials models across length and time scales"

Multiscale in Biology

There are a lot of directions to go when building upward from electronic structure. Another possibility is biology. This article is a good overview:
"Multiscale computational models of complex biological systems".
Scales encountered include

genetic
protein
pathway and signaling
whole cell
cell network and tissue
whole organ
multisystem and organism

(And in some cases, this could go higher to populations and ecosystems)

Multiscale in Computing

It may be instructive to compare with various scales in computing

materials (doped silicon, etc)
transistors
gates
larger logic assemblies (shift registers, adders, etc)
processors (CPU)
assembly language
low level programming language
high level programming language
operating system
applications
large-scale distributed systems

Now perhaps not every hierarchical description of abstractions should be considered 'multiscale', but it may be worth considering the similarities and differences.

One of those differences is where engineering (control) can occur. For computing, the entire stack is engineered. For materials, points of control are composition, manufacturing process, and component design. For drug design in biology, the only point of control/design is the drug molecule. All the effects above that level are no longer under direct control - understanding how these effects propagate through the scales is very difficult problem.

General Approaches

One of the limiting factors is how much computer time it takes to solve the model at each scale. Approaches for speeding up existing models (via multigrid or reducing the degrees of freedom) start to merge with methods for generating models for higher scales. One approach is systematic upscaling, based on multigrid and renormalization ideas. See Principles of Systematic Upscaling (pdf).

Another approach is called Heterogeneous Multiscale Modeling (HMM) See this overview page and a review article (pdf)

Connections to QMC

Given this blog is focused on QMC, how these multiscale ideas might relate to QMC. Solving the electronic structure problem forms the lowest level of the hierarchy (for practical engineering problems, anyway). Can we speed up QMC calculations, possibly through creating reduced-order models? Also, can existing information (previous runs on the same system, or information from higher levels) be used to speed the calculation?

Using the density in QMC

2015-08-21T09:10:00.000-05:00

A while ago, I made a small post wondering about Levy constrained search in QMC?
That post was idle speculation, but there have been some papers published seriously exploring the idea for combining DFT and QMC.

The series of papers:
[1] Electronic Energy Functionals: Levy-Lieb principle within the Ground State Path Integral Quantum Monte Carlo (L. Delle Site, L. M. Ghiringhelli, D. Ceperley, May 2012)

[2] Levy-Lieb Principle meets Quantum Monte Carlo (L. Delle Site, Nov 2013)

[3] Levy-Lieb principle: The bridge between the electron density of Density Functional Theory and the wavefunction of Quantum Monte Carlo (L. Delle Site, Nov 2014)

And two related earlier papers:
Levy-Lieb constrained-search formulation as a minimization of the correlation functional (L. Delle Site, Apr 2007)
The design of Kinetic Functionals for Many-Body Electron Systems : Combining analytical theory with Monte Carlo sampling of electronic configurations (L. M. Ghiringhelli, L. Delle Site, Nov 2007)

(Dates are submission dates to arXiv, not publication dates)

Key issues when combining DFT and QMC:

How to compute density functionals from a QMC calculation?
There are two ways to go about this:

Invert the samples to find a functional.
To understand this we need to know a little bit about Orbital-Free Density Functional Theory. While the foundation of Density Functional Theory is that the ground state energy is a functional of the density, in practice it is a complicated functional (particularly the kinetic energy piece). The standard approach is to reintroduce orbitals, which can accurately compute the kinetic energy but also at a cost in performance. Finding a sufficiently accurate functional without needing orbitals could greatly speed up DFT calculations. Machine learning techniques are starting to be used to compute complicated functionals (see Finding Density Functionals with Machine Learning )
This approach could yield useful, transferable functionals.
Use the QMC sampling to perform the integral over the functional (see [2]). There is no need for an explicit expression of the functional.

How to use a given density in a QMC calculation?
In [1], the authors propose the use of Ground State Path Integrals (GSPI), with some additional thoughts on how to sample at fixed density in [2].
The reason for using GSPI is that the trial wavefunction affects efficiency, but not necessarily the final result, and the addition of a density constraint may speed up sampling. Compare with a VMC approach, where a minimization of the parameters at the fixed density would still be needed (and of course the accuracy would still be limited by the flexibility of the trial wavefunction).

The final proposal is to make the process iterative - use DFT to compute the density, use GSPI to compute new integrals over the functionals (as in 1B), use those to recompute the DFT density, etc.

It seems like an intriguing proposal, and I'm curious to see how well it would work on a realistic problem ([2] proposes a helium dimer as a good test case).

Video of SciPy 2011 talk

2011-10-05T23:18:00.000-05:00

The video from my SciPy 2011 talk is available here.

At the end of the talk, Andy Terrell asked a question about why this structured approach is necessary - why not just program the steps directly in a script?

I agree that the system I presented does seem very 'heavyweight'. But it feels that doing the transformations directly loses some meta-information that would be useful to keep around, such as which transformations were applied at each step. Additionally, by storing the the pieces of the derivation in data structures, they are potentially easier to manipulate (and emphasizes that they can and should be programmatically manipulated). However, Python's meta-tools are quite powerful, and could probably used to manipulate derivations written directly in a script.

Switching to a different issue, the presented example effectively inlines all the functions into the integral. This is clearly not scalable in the face of larger and more complex expressions. The system needs some way to represent pieces that should transform into a function call in the generated code. I have something working now, but it needs some clean-up work.

Modeling derivations: Questions and Answers

2011-06-07T07:32:00.003-05:00

The past two posts (part I and part II) described a system (or style) for scientific programming that I've been working out. This post will have more in a Q&A format (The questions are mostly ones I ask myself - perhaps it should better be called a 'rhetoric & rant' format) An abstract for a talk and paper was accepted to SciPy 2011, so hopefully I can explain it clearly enough by then to get some feedback.

What is the target application? Goals?

The target goals are better enabling algorithm research and development (that is, writing a scientific code), and a flexibility to target different runtime systems.

Looking at the examples and case studies on various computer algebra system (CAS) websites, I see many cases where the user modeled some system, solved it, and then used the solution to improve or fix the system. The hard part seems to be modeling the system - this is not my target. I'm interested in problems where the second step is the hard part - and one needs to research new and better solvers.
I'm thinking in terms of a traditional scientific code, and how to automate parts of the development workflow. One big issue is how to use a CAS to assist in the development. Also I'm interested in problems that require high performance computing (HPC) - basically problems that will consume as much computing power and algorithm research as possible. This likely means the code will need to be tuned and altered to support different target systems (multicore workstations, distributed memory parallel clusters, accelerators like GPU's, etc).
Why not use a compute algebra system directly? It looks like you're creating a lot of extra work for yourself?
I would like a separation between the specification of an algorithm (for scientific computation, this is largely expressed in equations) and its implementation in the runtime system (along with an automated transformation from the spec to the final system).

Typically a CAS is associated with full featured programming language, but implementing the algorithm in that language doesn't preserve this separation.

As a concrete example, a CAS typically includes facilities for performing numerical integration. What if I want to develop a new numerical integration method - are there any examples on how a CAS can be used to do that?

Another issue with implementing HPC-style problems in a CAS is what happens when the existing solvers in the CAS are too slow, or the CAS system language is too slow (eg, how to scale to large parallel systems). The CAS may do a great deal of work for you, but what happens when you want to move past it? This is the part where code generation comes in.
What is the role of code generation in this system?
More generally, the 'code generation' step is the 'transformation to the target runtime system' step. The transformation may be trivial, if the final equations can be solved by the CAS facilities, but I'm anticipating the need to ultimately target C/C++ or Fortran.
Also, it seems useful to introduce this method in small pieces into an existing code (converting an entire code would usually be too dramatic of a change to be successful). It would smooth the transition for the generated code to fit the style of the existing code. A template system for generation would be a useful approach. A text-oriented (or basic token oriented system, like the C pre-processor) would work, but they have problems. A system oriented around an actual parse tree of the target language would allow the most powerful transformations.

Is this related to reproducible research in any way?
Packages for reproducible research seem to be focused on the level of capturing parameters that are input to a scientific code, so that the results can be repeated and revisited later.
Executing a program may be reproducible, but the construction of that program is not.

Aside:
Reproducibility can be achieved through documentation or automation. The previous standard for scientific work depended primarily on documentation. With computers, automation becomes possible as a working method. (For space reasons, this analysis is very simplistic --- there is more to it.)
</aside>

This work is intended to make the construction of a code automated. One advantage is in making corrections to mistakes in the derivation - fix it, push a button, and the final code is automatically updated.

This should be a complementary to other systems for reproducible research.
Why are you doing this?
I'm tired of tracking down missing minus signs and misplaced factors of 2 in my code. A significant feature of this approach is that the entire chain of derivation and generation of the program is performed by machine, to eliminate transcription errors. (If this sounds too rosy, rest assured there are plenty of other types of errors to make).

Example of modeling derivations

2011-04-23T23:19:00.002-05:00

The last post discussed a style of scientific programming that modeled derivations symbolically to create a machine-checked chain from starting equation to final code.

This post will provide an example using the configuration integral for the canonical ensemble in statistical mechanics. I want to evaluate the integral using various methods, starting with grid-based integration methods. Using a grid-based method suffers from the curse of dimensionality, where adding particles raises the time to compute exponentially (which is why Monte Carlo methods are normally used). However, it can still be useful to check Monte Carlo codes with other integration methods for small particle numbers. Also, the free energy (and all associated quantities) is computed directly, where it is more difficult to compute with MC methods. Finally, I'm curious as to how large of a system could be computed on today's hardware.

Sympy is the computer algebra system I'm using. The code for this example is available on github in the derivation_modeling branch, in the prototype subdirectory.

The output (MathML and generated python) can be seen here.

The flow is as follows:

Derive the composite trapezoidal rule, starting from the expression for a single interval

Generate code for the composite trapezoidal rule

Start from the configuration integral (which I've called the partition function in the documents) and transform it by specializing various parameters - 2 particles in 2 spatial dimensions, using the Lennard-Jones interaction potential, etc - until it becomes completely specified numerically.

Generate code that evaluates this integral using the generated integration code from step 2

There's still a tremendous amount of work to do on this, but hopefully an outline of this style of programming is visible from the example.

The MathML pages have a lot of detailed small steps - it would be nice to have a collapsible hierarchy so all the little steps can be hidden.

Some abstractions should remain during the steps - in particular the potential. The expression for the integrand can get quite messy from explicitly inserting the expression for the potential (and this is a simple case - more complex potentials would make it unreadable).

More backends for the code generation - C/C++ is the obvious choice in pursuit of higher performance. (The generated python can be run through Shed Skin, for another route to C++). Converting to Theano expressions might be a quick way to try it out on the GPU.

Modeling derivations

2011-04-07T01:55:00.002-05:00

A previous post describes how I kept reference versions of important derivations in a document. I would like to move the idea out of LaTeX and into some format more amenable to machine checking. To do this requires some structure for describing and modeling a derivation.

A derivation starts with some base equation, then proceeds through a series of steps transforming it into a form that can be solved numerically (A symbolic/analytic solution would be nice, but not possible in most cases.) The steps can be divided into three categories

Exact transformations - rearranging terms, replacing an expression with a different form, the usual steps involved in a proof, etc

Approximations - using a truncated series expansion, replacing a derivative with discretized version, etc.

Specializations - Specifying the physical or model parameters in order to make the equation numerically solvable - like the number of spatial dimensions, number of particles, interaction potentials, etc.

At the end of the derivation, the result can be evaluated numerically using the CAS (Computer Algebra System - I assume this derivation modeling has been implemented using one) facilities, if simple enough. More complex or time-consuming results (a more likely case) can be further transformed by code generation to C/Fortran (or some other execution system).

One important feature of this workflow is every step in creating the program is checked by machine. This should greatly reduce transcription errors (dropped minus signs and factors of 2 - this is what started the whole project in the first place). Also it moves part of the process of creating scientific programs into the realm of 'ordinary software', and as such makes it subject to normal software engineering techniques - unit testing, etc. Testing scientific (numerical) software is a difficult problem - this should help make it (a little) easier and more reliable.

Some other work I've run across related to describing mathematics:

Structured derivations were developed for teaching proof in high school mathematics.

OMDoc is a markup language for describing mathematical structures at a higher level than a single formula (a proof, for instance)

Levy constrained search in VMC?

2009-08-27T00:19:00.000-05:00

The Levy constrained search formalism is a useful way to reason about density functional theory. It finds the minimum energy using two nested searches. The inner search seeks the minimum energy wavefunction for a particular density. The outer search seeks the density with the minimum energy.

This could be directly implemented using VMC. The main ingredient is energy minimization subject to a density constraint. Has anyone done this?

It wouldn't be very efficient compared to optimizing the VMC energy directly, but it would make an interesting connection between the methods.

Why scientific computations fail

2008-12-14T14:49:00.002-06:00

(I was busy this fall with another project. Now it's time to return to scientific programming)

The previous post covered several levels of tests for assessing the correctness of scientific computation. Now let us look at reasons why the results might not match what is expected.

Programming errors
Mistakes in derivation and transcription of equations
Violations of assumptions or conventions
Errors from floating point
Errors and undesirable behavior from the algorithm (often problem-dependent)
- Algorithms that have been well-studied by numerical analysts
- Those that have not

Programming errors

The usual sorts of problems in programs (array indices, memory problems, etc). This category of problems can be tested with unit tests and can make use of tools and practices from mainstream software engineering.

Mistakes in derivation and transcription of equations

This is similar to the implementation not matching the specification in traditional software engineering. With scientific computation, though, the "specification" could be more precise in principle (equations and algorithms), and computer algebra systems could aid with performing and verifying steps in the derivation.

Violations of assumptions or conventions

Usually happens during derivation of the equations. A stronger connection between the derivation and the final code might help this.

Errors and instabilities from floating point

Accumulated round-off, loss of precision from subtracting nearly equal numbers, and other problems. For more subtleties than you every wanted to know about, visit W. Kahan's site.
One of the most annoying parts about floating point is lack of good model for understanding (and analyzing) floating point errors. I will return to this issue in a future post.

Errors and instabilities from the algorithm.

This category might be better divided into two subcategories - algorithms that have been well-studied by numerical analysts (like matrix operations, certain ODE's and PDE's, optimization), and those that have not. Many algorithms have been well-studied and have nice packages available to use as black-boxes. However, most scientific programs are a composition of well-known algorithms along with some problem-dependent features as well.

Often times the behavior of the algorithm depends on the details of the physical system (ie, is it a metal or insulator)?

This category also includes science-based and numerical approximations.

Now, how do we go from test case failure to deciding which of these categories the problem falls into (and then, of course, to locating the precise problem)?

Unit testing will help with the first category. But not so much with the rest. Unit testing is helpful when you want to make sure things haven't changed. The hard part is getting it right the first time.
The different levels of tests in the previous post should give some insight into which category the problem might occur.

First I want to look at various models of floating point error and see if we can gain some understanding of the intrinsic uncertainty in a calculation. This seems to be a necessary prerequisite for answering the question of 'how close is close enough?', in order to decide whether a test has actually failed or not.

Assessing the correctness of scientific computation

2008-07-13T21:54:00.000-05:00

In the previous post I promised some thoughts on assessing the correctness of scientific computation.

Computational science programs are typically assessed with a series of test cases, ranging from order-of-magnitude estimates ("those numbers don't look right") to comparing with experimental data. I've attempted to divide these test cases into categories, where each category shares similar expectations as to what 'close enough' means and has similar implications into potential causes for discrepancies.

Order of magnitude estimates from general principles, back of the envelope calculations, or experience with similar problems.
Analytically solvable cases.
Small cases that are accurately solvable
For small systems, or simple cases, there are often multiple solution methods, and some of them are very accurate (or exact). The methods can be compared for these small systems. An important feature of this category is any approximations can be controlled and the effects made arbitrarily small.
Results from the same (or similar) algorithm
Comparing the same system with the same algorithm should yield close results, but now there is additional uncertainty in the exact implementation, and the sensitivity to input precision (particularly when comparing results from a journal article)
Results from a different algorithm
This looks similar to the second category (comparing with exact or more precise algorithms), but this is the case where there a multiple methods (with different approximations) for handling larger systems. There are likely to be more approximations (and probably uncontrolled approximations) involved.

(In electronic structure there is Hartree-Fock(HF), several post-HF methods, Density Functional Theory, and Quantum Monte Carlo. And possibly a few others.) Now one has to deal with the all the above possibilities for two programs, rather than just one.
Experimental results
In some ways this is quite similar to the the previous category, except determining "sufficiently" close requires some understanding of the experimental uncertainties and corrections in addition to possible program errors.

The testing process for each case involves running the test case, deciding whether the program result is 'close enough'. If it is, proceed to the next test case. If it is not, work to discover the cause of the discrepancy. Next post I'll look at a list of possible causes.

Engineering scientific software

2008-07-12T01:20:00.002-05:00

Some pointers from Greg Wilson on engineering scientific software.

The first pointer is to the "First International Workshop on Software Engineering for Computational Science and Engineering".

A couple of Wilson's short summaries of papers from this workshop
(from recent research reading)

Diane Kelly and Rebecca Sanders: “Assessing the Quality of Scientific Software” (SE-CSE workshop, 2008). A short, useful summary of scientists’ quality assurance practices. As many people have noted, most commercial tools aren’t helpful when (a) cosmetic rearrangement of the code changes the output, and (b) the whole reason you’re writing a program is that you can’t figure out what the answer ought to be.

Judith Segal: “Models of Scientific Software Development”. Segal has spent the last few years doing field studies of scientists building software, and based on those has developed a model of their processes that is distinct in several ways from both classical and agile models.

What I found interesting:

Kelly and Sanders suggest that the split between verification and validation is not useful, and propose 'assessment' be used as an umbrella term. (I will return to the point in a later post)

The second pointer is to the presentations at TACC Scientific Software Days.

One of the presentations there was by Greg Wilson himself on "HPC Considered Harmful". There is one slide on testing floating point code (slide 15). In particular the bullet point testing against a tolerance is "superstition, not science". (And he's raised this point elsewhere)

This statement has bothered me since I first read it, and so the next few posts will explore some of context surrounding this question. The first post will list different methods for assessing the correctness of a program, the next post will categorize possible errors, and finally a post looking at 'close enough'.

Computing an unbiased inverse

2008-05-29T22:23:00.000-05:00

In a previous entry, Reweighting is biased, I had discussed the problem of a finite-sample size bias in reweighting. In this entry I will investigate a possible solution.

But first, I had asked if this was known in the literature. A comment by Lee Warren pointed to this article, Population size bias in descendant-weighted diffusion quantum Monte Carlo simulations, which discusses this very issue (in a slightly different context, but the root issue is the same). In that context the problem is even more severe since the straightforward solution (increase N) carries a heavier computational cost.

One of the crucial issues is finding a good estimator for the inverse (1/D). The naive estimator (simply do the division) is biased. So the question is: are there any other ways of performing division that could be used? Yes, in digital arithmetic there are a couple of algorithms for division using only multiplication and addition - Newton-Raphson and Goldschmidt (see the Wikipedia entry on digital division)

In general, to have an unbiased estimator, each sample of the random variable must enter the estimator linearly (alternately, we can say each sample should only be used once and then discarded)

The Newton-Raphson iteration for the inverse (1/D) is x_i+1 = x_i(2-D x_i). To make this unbiased, in each iteration the value of D needs to be an independent sample, and the values of x_i from previous iterations also need to be computed from independent samples. One can think of the algorithm forming a tree, with each iteration branching to two children (each child is one of the x_i computed by the previous iteration), until reaching the leaves representing the initial iteration (x₀ = T₁ + T₂ D)

I tested this algorithm, and it appears to give unbiased results. There is a catch, though. The result has a large variance. Using the average of all the samples with the naive estimator yields a bias that is much smaller than the variance of the unbiased estimator (hence the mean squared error is a more appropriate metric for the combination of variance and bias)

Open questions/issues:

Is there some modification of the algorithm that will produce a smaller MSE than the naive estimator? Or even an MSE that isn't too much larger?

How many iterations are necessary - how to analyze the convergence of this stochastic Newton-Raphson iteration?

Try out the Goldschmidt algorithm.

Reproducible Research

2008-04-06T00:56:00.000-05:00

There were a couple of recent posts on Greg Wilsons's blog about reproducible research. The first is a pointer to a paper about WaveLab. The paper is from 1995, although it appears the philosophy hasn't changed much looking at a more recent (2005) introduction on the website. The second post contains a reference to the Madagascar project.

These projects follow the work of Jon Claerbout on Reproducible Research.

Thoughts

This work is all about designing a workflow for computational research, and creating tools to support that workflow. Much of it seems quite banal, but part of the whole purpose of doing this is to free you from having to think about the boring and repetitious steps.
Levels of reproducibility
1. Yourself, right away (when tweaking figures, for instance)
2. Another research group member, or yourself 6 months later
3. Someone in another group, (a) able to run the code and get the same results, and (b) able to understand the code
4. Someone in another group, able to use your description to write their own code that gets the same results
The last level is the gold standard in scientific reproducibility. Experiments are reproduced from published descriptions (ideally, although the descriptions aren't always complete). Or theorists rederive intermediate steps leading to a paper's conclusion.

In software development, levels 1-3 are what counts. (It's not very useful software development if others can't run your code and get expected results.). The last level is similar to multiple implementations of a standard (eg, C++ compilers from multiple vendors.)

The referenced works seem to focus mostly on processing of raw data (images, seismic imaging, etc) and the processing steps to produce a final result.

How would this apply in other fields? Say, ones with smaller input data sets, but much higher computational demands.

The Madagascar project, for example, uses SCons to coordinate the steps in producing research output. What happens when one of the steps involves several weeks of supercomputer time? Two immediate problems that spring to mind - one probably doesn't want this step launched without warning by the script. Furthermore, the steps for launching such a job are likely to be very site-specific, making it difficult for others to reproduce elsewhere.

Somewhat related entries of mine: Keeping notes and Solving Science Problems

Notes from PyCon 2008

2008-03-19T22:50:00.001-05:00

I attended PyCon 2008 this past weekend, and had a good time meeting people in the Python commmunity.

Some notes

I tried to give a lightning (5 minute) talk about sympy and code generation. Unfortunately, Ubuntu 7.10 on my laptop (Dell Vostro) failed to display to the external output, so I was unable to present to the larger audience. However, I was able to give the talk to the people at the science BOF session. The slides are available here

At the science BOF session I met Michael Tobis, the author an interesting system called PyNSol. It uses python to describe the differential equations, and generates Fortran for performance.

The python 2 to 3 converter looks interesting. Not so much for the conversion of python versions, but for the infrastructure developed for the tool.

Enthought was showing off some multi-touch displays. Being able to interact with 3D visualizations (and Google Earth) via gestures for zoom, pan, and rotation made for a nice demo. The real question is if this type of display would be useful for actual use.

Code generation now uses templates

2008-03-17T20:34:00.000-05:00

In Quameon, the derivatives for the Gaussian basis functions are computed with sympy and the python code is generated. Creating the code that surrounds the generated code is a bit of a tedious procedure - the syntax tree needs to be constructed by hand. But no more! The surrounding code can now written in python and a parser (written using pyparsing) will create the syntax tree for you.

Some details:

Identifiers surrounded by percent signs (eg %name%) get turned into a template node type for subsequent replacement.

Example template:


class %orb_name%:
  def compute_value(self,alpha,v,r2):
    x=v[0]
    y=v[1]
    z=v[2]
    %value%
    return val

The code replaces %orb_name% with the orbital name (s,px,py,px,etc) and %value% gets replaced with the expression for the value of the orbital (the methods for the derivatives not shown - see codegen/primitive_gaussian/deriv.py in SVN for the full template.)

Why python?

2007-12-21T08:31:00.000-06:00

The author of the first comment on the Quameon entry wonders if Python will be too slow.

One reason to use Python is ease of programming. Hopefully the advantages of programming in Python will outweigh the performance penalty.

However, at some point, runtime speed will be an issue. Here's a list of steps to improve runtime performance (in increasing order of effort/expected return)

Use psyco. Put "import psyco; pysco.full()" at the beginning of your program and get an instant speed increase.
Use the Shed-Skin Python to C++ compiler. (I haven't tried this yet - don't know if it works with Quameon.)
Profile and rewrite slow parts in C/C++/Fortran. I've done this with the inter-particle distance computation in a classical MC code in Python, and gained some performance.
Code generation. Right now, the code for the Gaussian basis functions is auto-generated (no more computing derivatives by hand!!). I would like to move more of the basic formulas to this technique. Then it should be easier to retarget the code generation to another language (in theory, I haven't tried it yet).

Other tidbits:

Monte Python uses Python and C++ for QMC (paper, code). They wrote the time-consuming parts in C++, and use Python to glue it all together. The algorithm variants they tested (parallel distribution strategies for DMC) were achieved through changing the Python part of the code.

Code generation could potentially result in *faster* code than writing a general framework in C++ or Fortran. This is because the code can be specialized for a given physical system, exposing more optimization opportunities. For an example of code generation, look at the development SVN area of PyQuante - there is a program that will produce a C++ program to compute the energy for a molecule given in the input file.

Quameon

2007-12-05T23:32:00.001-06:00

I've been working on a Quantum Monte Carlo code implemented in Python. It's called "Quameon" (the name is a mixture of letters from "Quantum Monte Carlo in Python"). There's a SourceForge project for it - here are links to the SourceForge project page, the project web page, and the project wiki.

The code doesn't do much currently, but I mention the project now so I can write blog entries about various aspects of getting the code working.

I'm currently trying to validate that the code works by using HF single particle orbitals (obtained from PyQuante) and no jastrow factor - this should reproduce the HF energy. It seems to work for a few atoms (He, Be, C), after sorting out some normalization issues with the basis sets (apparently all basis sets are not normalized - this doesn't affect the energy, but does affect the eigenvector coefficients). I will need longer runs to be sure. The next step is to profile the code to look for any quick and easy performance tuning opportunities.

The first target (after getting the code running reliably) is to implement several forms for jastrow factors and several VMC parameter optimization algorithms.

Keeping notes

2007-10-04T22:38:00.000-05:00

Keeping a record of work is important for research. I use two main tools for this purpose. First, a text file with entries in chronological order (one file per month, with names like "Sept2007"). I write the date and then start taking notes for that day in the file.

Pro - easy to use, chronological view easily available
Con - math hard to write well, ideas spread over several days hard to view topically.

The other tool I use is a wiki. It's useful to keep data in a structured, hierarchical format that's easy to edit.

Pro - ideas arranged topically, can do better looking math
Con - must work topically from the start, no good chronological view (Yes, I know you can view changes in order, but it's not as easy to reconstruct what I was thinking on a particular day.)

I want a single tool that does both of these - that can view the notes both chronologically and by topic.
For work-flow, I want the ability to enter ideas and notes free-form, and perform categorization and annotation at a separate time.

Another additional feature that would be nice is recording other actions on the system (at least by reference).
One example is check-ins to source control.

Another possible example - What if gnuplot were to record all the commands issued (and kept a copy of all the files used?) Graphs could be re-created and played back at a later date easily.

I looked around a little bit to see if there are other applications that might meet some of these requirements (or at least be an improvement over text files and a wiki). One class of app is a note taking application. I looked at Tomboy. It does have a Latex plugin (though I could not get it to work). Latex side-rant: It's great for making nice looking mathematics, but you wind up with "equations under glass". It's pretty to look at, but you can't manipulate it further. For mathematical content, I want something with more precise semantics (Content MathML, for example)

Another class of tool is mind mapping software. I looked at Vym (View your mind) briefly, but I don't think I will like this type of software. Viewing a graph of links is nice, but not as the main working view - I prefer a wiki view (pages with links)

Black Box Reweighting

2007-08-19T21:56:00.000-05:00

In the last post, I wondered whether the black box reweighting method might help with the bias.

The Black Box Reweighting (BBRW) method computes weights using the observed distribution of samples, rather than the targeted distribution. (The author's website contains a link to supplementary material with further discussion.)

I ran a simple test, and the bias was slightly worse with the BBRW method, compared to standard reweighting.

The BBRW method does a fine job of solving the problem the authors intended (correct handling of a distribution that may be incompletely sampled, and lower variance of the result - the lower variance of the result was apparent in my tests), but it doesn't solve the reweighting bias issue.

The search for a method to correct reweighting bias continues...

Reweighting is biased

2007-08-14T22:52:00.000-05:00

The reweighting technique, where samples from one probability distribution are adjusted with a weight factor to compute averages from a different (but usually very similar) distribution, is biased for a finite number of samples.

The details can be found here.

Question for the audience: Is this known? I don't recall seeing this mentioned any time the method has been presented.

Some possible routes to fix the problem are listed in the paper, but alas, I haven't been able to turn any of them into a useful solution yet.

VMC optimization papers

2007-08-10T22:29:00.000-05:00

A recent comment suggested these papers on VMC optimization:

Alleviation of the Fermion-sign problem by optimization of many-body wave functions

Wave function optimization in the variational Monte Carlo method

I'd also add these to the pile of papers to read and understand:

Optimization of quantum Monte Carlo wave functions by energy minimization

A simple and efficient approach to the optimization of correlated wave functions

And since this post is alread about VMC optimization, some further questions:

What about simultaneous geometry and parameter optimization?

The focus seems to be on single geometries - what's likely more interesting in the future is a family of geometries - has anyone parameterized the VMC parameters (by the bond lengths or nuclear coordinates)?

Correct estimators in DMC without forward walking?

2007-08-10T21:09:00.000-05:00

Computing properties other than the energy using Diffusion Monte Carlo requires forward walking to get correct answers. That may change, if this paper - Hellman-Feynman operator sampling in Diffusion Monte Carlo calculations - is correct.

Although as I look at the paper more, the technique doesn't look any easier than the forward walking implementation by Casulleras and Boronat (the paper does mention this similarity). It appears to use similar data in a different combination. I would be interested to see how the noise, bias, and stability characteristics compare.

The Hellman-Feynman theorem seems to be getting lots of attention in the QMC world recently. It's also been used to construct Zero-Variance Zero-Bias estimators (most recently example here).

Gravitational N-body and the Art of Computational Science

2007-08-02T22:35:00.000-05:00

A rather ambitious project to describe a gravitional N-body code and how to write it in Ruby: The Art of Computational Science

As Greg Wilson points out, something like this would be nice to have in Python.

Efficient Global Optimization

2007-03-17T23:51:00.000-05:00

The optimization method mentioned last post is an improvement on Efficient Global Optimization, presented in the paper Efficient Global Optimization of ExpensiveBlack-Box Functions (also available from on one of the author's websites)

Another paper listed in the references, A Taxonomy of Global Optimization Methods Based on Response Surfaces(by D. R. Jones, one of the authors on the EGO paper) is a nice overview of various optimization methods. It presents each method and gives ways in which it can be fooled into choosing the wrong point as the optimum. A fix is given, which then leads to the next method in the list.

One of the core ideas is regression based on Gaussian processes (kriging). To learn more about Gaussian processes, see www.gaussianprocess.org. The two Jones papers listed above also show how the GP based methods fit in with other basis set methods.